Variant Discovery ◾ 129
cd ..
#7- Marking/removing duplicate alignments
mkdir dupRemBam
cd sortedbam
for i in $(ls *.bam);
do
samtools rmdup ${i} ../dupRemBam/${i} 2> ../dupRemBam/${i}.log
done;
cd ..
#8- Alignment pileup and variant calling using bcftools
mkdir variants
cd dupRemBam
for i in $(ls *.bam);
do
samtools index ${i}
done;
ls *.bam | rev | rev > bam_list.txt
freebayes \
-f ../ref/GCF_009858895.2_ASM985889v3_genomic.fna \
-C 5 \
-L bam_list.txt \
-v ../variants/sarscov2.vcf
cd ..
#9- Compress and index the VCF file
bgzip variants/sarscov2.vcf
tabix variants/sarscov2.vcf.gz
#to view contents
bcftools view variants/sarscov2.vcf.gz | less -S
We may notice that there is a little difference between the VCF file generated by bcftools
and the one generated by FreeBayes. The reason is that each program uses a different algo-
rithm for variant calling as discussed above.
4.2.2.2 GATK Variant Calling Pipeline
Genome Analysis Toolkit (GATK) [8] is a collection of command-line tools for variant dis-
covery and analysis of the sequence data acquired from the high-throughput sequencing
instruments. For installing the latest release of GATK, follow the installation instructions
available at the GATK website “https://gatk.broadinstitute.org/”. GATK uses the haplo-
type approach as FreeBayes. However, it uses advanced workflow for variant calling. The
GATK tools can be used individually or together as a complete pipeline for variant discov-
ery. GATK also provides complete workflows called GATK Best Practices for variant call-
ing tailored for specific cases. In the following, we will discuss the most commonly used
variant calling pipeline for GATK best practice. The following workflow assumes that the
acquired raw data in FASTQ format and that the quality control has been performed on the
data. The general GATK workflow consists of the following major steps: